Last updated : October 16th, 2022
During this project, I will preprocess 2 datasets from yelp : a 7GB reviews dataset and a 9GB photos dataset. This preprocessed data will then be fed respectively to a NLP and CV models. This notebook only includes the preprocessing part of this project.
To check the viability of our preprocessing pipelin in production, I will also implement a Yelp API querying algorithm to query new data.
#Importing packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
#Setting large figure size for Seaborn
sns.set(rc={'figure.figsize':(11.7,8.27),"font.size":20,"axes.titlesize":20,"axes.labelsize":18})
#Importing Intel extension for sklearn to improve speed
# from sklearnex import patch_sklearn, unpatch_sklearn
# patch_sklearn()
#import cudf
import dill
#64GB of RAM so no need to compress data
business = pd.read_json("Data/yelp_academic_dataset_business.json", lines=True)
# checkins = pd.read_json("Data/yelp_academic_dataset_checkin.json", lines=True)
reviews = pd.read_json("Data/yelp_academic_dataset_review.json", lines=True)
# tips = pd.read_json("Data/yelp_academic_dataset_tip.json", lines=True)
# users = pd.read_json("Data/yelp_academic_dataset_user.json", lines=True)
business.head()
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pns2l4eNsfO8kk83dixA6A | Abby Rappoport, LAC, CMQ | 1616 Chapala St, Ste 2 | Santa Barbara | CA | 93101 | 34.426679 | -119.711197 | 5.0 | 7 | 0 | {'ByAppointmentOnly': 'True'} | Doctors, Traditional Chinese Medicine, Naturop... | None |
| 1 | mpf3x-BjTdTEA3yCZrAYPw | The UPS Store | 87 Grasso Plaza Shopping Center | Affton | MO | 63123 | 38.551126 | -90.335695 | 3.0 | 15 | 1 | {'BusinessAcceptsCreditCards': 'True'} | Shipping Centers, Local Services, Notaries, Ma... | {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ... |
| 2 | tUFrWirKiKi_TAnsVWINQQ | Target | 5255 E Broadway Blvd | Tucson | AZ | 85711 | 32.223236 | -110.880452 | 3.5 | 22 | 0 | {'BikeParking': 'True', 'BusinessAcceptsCredit... | Department Stores, Shopping, Fashion, Home & G... | {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ... |
| 3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... |
| 4 | mWMc6_wTdE0EUBKIGXDVfA | Perkiomen Valley Brewery | 101 Walnut St | Green Lane | PA | 18054 | 40.338183 | -75.471659 | 4.5 | 13 | 1 | {'BusinessAcceptsCreditCards': 'True', 'Wheelc... | Brewpubs, Breweries, Food | {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2... |
Here is the shape of the first lines on the reviews dataset :
reviews.head()
| review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | KU_O5udG6zpxOg-VcAEodg | mh_-eMZ6K5RLWhZyISBhwA | XQfwVwDr-v0ZS3_CbbE5Xw | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 |
| 1 | BiTunyQ73aT9WBnpR9DZGw | OyoGAe7OKpv6SyGZT5g77Q | 7ATYjTIgM3jUlt4UM3IypQ | 5 | 1 | 0 | 1 | I've taken a lot of spin classes over the year... | 2012-01-03 15:28:18 |
| 2 | saUsX_uimxRlCVr67Z4Jig | 8g_iMtfSiwikVnbP2etR0A | YjUWPpI6HXG530lwP-fb2A | 3 | 0 | 0 | 0 | Family diner. Had the buffet. Eclectic assortm... | 2014-02-05 20:30:30 |
| 3 | AqPFMleE6RsU23_auESxiA | _7bHUi9Uuf5__HHc_Q8guQ | kxX2SOes4o-D3ZQBkiMRfA | 5 | 1 | 0 | 1 | Wow! Yummy, different, delicious. Our favo... | 2015-01-04 00:01:03 |
| 4 | Sx8TMOWLNuJBWer-0pcmoA | bcjbaE6dDog4jkNY91ncLQ | e4Vwtrqf-wpJfwesgvdgxQ | 4 | 1 | 0 | 1 | Cute interior and owner (?) gave us tour of up... | 2017-01-14 20:54:15 |
Since we are working for a restaurant company, we are only interested in businesses which are restaurants. We will create a list of the restaurants businesses and merge it with our reviews dataframe to filter out reviews from other kind of businesses.
import re
def restaurant_select(x):
y = str(x)
r = re.compile(r'.*Restaurant.*')
if re.match(r, y):
return 1
else:
return 0
business["is_restaurant"] = business["categories"].apply(restaurant_select)
#Keeping only restaurants
business = business[business.is_restaurant == 1]
business = business[["business_id"]]
business
| business_id | |
|---|---|
| 3 | MTSW4McQd7CbVtyjqoe9mw |
| 5 | CF33F8-E6oudUQ46HnavjQ |
| 8 | k0hlBqXX-Bt0vf1op7Jr1w |
| 9 | bBDDEgkFA1Otx9Lfe7BZUQ |
| 11 | eEOYSgkmpB90uNA7lDOMRA |
| ... | ... |
| 150325 | l9eLGG9ZKpLJzboZq-9LRQ |
| 150327 | cM6V90ExQD6KMSU3rRB5ZA |
| 150336 | WnT9NIzQgLlILjPT0kEcsQ |
| 150339 | 2O2K6SXPWv56amqxCECd4w |
| 150340 | hn9Toz3s-Ei3uZPt7esExA |
52286 rows × 1 columns
#Keeping only reviews on restaurants
reviews = pd.merge(reviews, business, on="business_id", how="inner")
#Dropping user_id and business_id
reviews.drop(columns={"user_id", "business_id"}, inplace=True)
reviews.set_index("review_id", inplace=True)
print(reviews.shape)
reviews.head()
(4724684, 6)
| stars | useful | funny | cool | text | date | |
|---|---|---|---|---|---|---|
| review_id | ||||||
| KU_O5udG6zpxOg-VcAEodg | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 |
| VJxlBnJmCDIy8DFG0kjSow | 2 | 0 | 0 | 0 | This is the second time we tried turning point... | 2017-05-13 17:06:55 |
| S6pQZQocMB1WHMjTRbt77A | 4 | 2 | 0 | 1 | The place is cute and the staff was very frien... | 2017-08-08 00:58:18 |
| WqgTKVqWVHDHjnjEsBvUgg | 3 | 0 | 0 | 0 | We came on a Saturday morning after waiting a ... | 2017-11-19 02:20:23 |
| M0wzFFb7pefOPcxeRVbLag | 2 | 0 | 0 | 0 | Mediocre at best. The decor is very nice, and ... | 2017-09-09 17:49:47 |
We will now calculate the length of different comments and look at the distribution of this variable.
reviews["text_length"] = reviews["text"].apply(len)
reviews.head()
| stars | useful | funny | cool | text | date | text_length | |
|---|---|---|---|---|---|---|---|
| review_id | |||||||
| KU_O5udG6zpxOg-VcAEodg | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 | 513 |
| VJxlBnJmCDIy8DFG0kjSow | 2 | 0 | 0 | 0 | This is the second time we tried turning point... | 2017-05-13 17:06:55 | 477 |
| S6pQZQocMB1WHMjTRbt77A | 4 | 2 | 0 | 1 | The place is cute and the staff was very frien... | 2017-08-08 00:58:18 | 216 |
| WqgTKVqWVHDHjnjEsBvUgg | 3 | 0 | 0 | 0 | We came on a Saturday morning after waiting a ... | 2017-11-19 02:20:23 | 736 |
| M0wzFFb7pefOPcxeRVbLag | 2 | 0 | 0 | 0 | Mediocre at best. The decor is very nice, and ... | 2017-09-09 17:49:47 | 953 |
plt.hist(reviews["text_length"])
plt.title("Histogram of the length of comments")
plt.xlabel("Length (characters)")
plt.ylabel("Number of comments")
plt.show()
plt.show()
#looking at the distribution of reviews with less than 1000 words
plt.hist(reviews[reviews.text_length < 1000]["text_length"])
plt.title("Histogram of the length of comments with less than 1000 words")
plt.xlabel("Length (characters)")
plt.ylabel("Number of comments")
plt.show()
We can see that most comments have approximatively 200 words.
#Looking at reviews with less than 50 words
reviews[reviews.text_length < 50]
#Still some indications (ex. the gluten free pizza is unbeatable), we will not delete these samples
#We will delete reviews with less than 10 characters.
reviews = reviews[reviews.text_length>10]
reviews[reviews.text.isna()]
#No NA values
reviews.info(verbose=True, show_counts=True)
#reviews = cudf.from_pandas(reviews)
reviews.head()
<class 'pandas.core.frame.DataFrame'> Index: 4724551 entries, KU_O5udG6zpxOg-VcAEodg to nGLcmo0D3IKrqqgK1kutlA Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 stars 4724551 non-null int64 1 useful 4724551 non-null int64 2 funny 4724551 non-null int64 3 cool 4724551 non-null int64 4 text 4724551 non-null object 5 date 4724551 non-null datetime64[ns] 6 text_length 4724551 non-null int64 dtypes: datetime64[ns](1), int64(5), object(1) memory usage: 288.4+ MB
| stars | useful | funny | cool | text | date | text_length | |
|---|---|---|---|---|---|---|---|
| review_id | |||||||
| KU_O5udG6zpxOg-VcAEodg | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 | 513 |
| VJxlBnJmCDIy8DFG0kjSow | 2 | 0 | 0 | 0 | This is the second time we tried turning point... | 2017-05-13 17:06:55 | 477 |
| S6pQZQocMB1WHMjTRbt77A | 4 | 2 | 0 | 1 | The place is cute and the staff was very frien... | 2017-08-08 00:58:18 | 216 |
| WqgTKVqWVHDHjnjEsBvUgg | 3 | 0 | 0 | 0 | We came on a Saturday morning after waiting a ... | 2017-11-19 02:20:23 | 736 |
| M0wzFFb7pefOPcxeRVbLag | 2 | 0 | 0 | 0 | Mediocre at best. The decor is very nice, and ... | 2017-09-09 17:49:47 | 953 |
After analyzing the length of comments, we will look at their polarity.
Since we are interested only in negative reviews to find out the main topics, we will calculate the polarity of each comment and then keep only negative reviews.
#Using Vader to calculate the polarity of our reviews
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
example = reviews.iloc[3,:]
sentences = example.text.split('.')
#Calculating the difference if we calculate the polarity from all the review or if we split it in sentences
print(analyzer.polarity_scores(example.text)["compound"])
scores = []
for s in sentences:
scores.append(analyzer.polarity_scores(s)["compound"])
print(np.mean(scores))
#Significant difference !
#Looking at the actual review :
print(sentences)
#Clearly, the review is not that positive and it is even slightly negative
def analyze_polarity(x):
sentences = str(x).split('.')
scores = []
for s in sentences:
scores.append(analyzer.polarity_scores(s)["compound"])
return np.mean(scores)
0.8333 0.12201999999999999 ['We came on a Saturday morning after waiting a few months after opening hoping that they would resolve the issues from a new restaurant opening', ' We were seated right away and the server brought water, coffee and took our orders right away', ' We waited over 30 mins for breakfast', " I got the freebird and came out first before my husband's dish", ' While it tastes good, it was just potatoes and the spicy sausage gravy was mostly a sauce', ' There was barely any sausage', ' My husband got the ny deli omelette that had way too much cheese that it overpowered everything and very little pastrami', " Lastly, we were ready to go and our server spent at least 10 mins chatting at another table so I couldn't get our check", " I'm not sure if we will return", '']
reviews["polarity"] = reviews["text"].apply(analyze_polarity)
reviews.head()
| stars | useful | funny | cool | text | date | text_length | polarity | |
|---|---|---|---|---|---|---|---|---|
| review_id | ||||||||
| KU_O5udG6zpxOg-VcAEodg | 3 | 0 | 0 | 0 | If you decide to eat here, just be aware it is... | 2018-07-07 22:09:11 | 513 | 0.230957 |
| VJxlBnJmCDIy8DFG0kjSow | 2 | 0 | 0 | 0 | This is the second time we tried turning point... | 2017-05-13 17:06:55 | 477 | -0.042017 |
| S6pQZQocMB1WHMjTRbt77A | 4 | 2 | 0 | 1 | The place is cute and the staff was very frien... | 2017-08-08 00:58:18 | 216 | 0.371014 |
| WqgTKVqWVHDHjnjEsBvUgg | 3 | 0 | 0 | 0 | We came on a Saturday morning after waiting a ... | 2017-11-19 02:20:23 | 736 | 0.122020 |
| M0wzFFb7pefOPcxeRVbLag | 2 | 0 | 0 | 0 | Mediocre at best. The decor is very nice, and ... | 2017-09-09 17:49:47 | 953 | 0.059770 |
#Removing date and review_id data that we will not exploit to reduce dataframe size
reviews.reset_index(inplace=True)
reviews.drop(columns=["review_id","date"], inplace=True)
#Saving our current reviews file
with open('Data/reviews.pkl', 'wb') as file:
dill.dump(reviews, file)
with open('Data/reviews.pkl', 'rb') as file:
reviews = dill.load(file)
Let's look at the Distribution of polarity for our reviews :
plt.hist(reviews["polarity"])
plt.title("Histogram of the polarity of reviews")
plt.xlabel("Polarity of Reviews")
plt.ylabel("Number of Reviews")
plt.show()
It is clear that most reviews have a polarity of around 0. Let's now look at the distribution of the number of stars in each review:
plt.hist(reviews["stars"], bins=5)
plt.title("Histogram of the number of stars in Yelp Reviews")
plt.xlabel("Number of Stars")
plt.ylabel("Number of Reviews")
plt.show()
Since we are mainly interested in negative reviews, let's look at the number of stars of negative reviews (with negative polarity) :
#Looking at the star ratings over reviews with less than 0 polarity
plt.hist(reviews[reviews.polarity < 0]["stars"], bins=5)
plt.title("Number of stars of Negative Yelp Reviews")
plt.xlabel("Number of stars")
plt.ylabel("Number of reviews")
#This mostly validates our polarity scoring methodology
plt.show()
This confirms the relevance of our sentiment analysis since most reviews with negative polarity have between 1 and 2 stars.
Since we are only interested in negative reviews for this analysis, we will filter out the reviews with a positive polarity and with more than 2 stars.
#Since we want to keep only topics of dissatisfaction, we will only keep reviews with 1-2 stars with a negative polarity
df = reviews.loc[(reviews.polarity < 0) & (reviews.stars <= 2)].copy()
df["text"] = df["text"].apply(lambda x: str(x).lower())
df.shape
(486008, 8)
Before analyzing our reviews, we need to perform some basic operations to preprocess our text data.
We will use Spacy to Lemmatize and Tokenize our dataset.
This will remove stop words and punctuation and reduce the size of each reviews so that it will be easier to analyze them.
import spacy
spacy.prefer_gpu()
#Using the trf model is too long, so we will use the sm model here, would need to be changed for production
#nlp = spacy.load("en_core_web_trf")
nlp = spacy.load("en_core_web_sm")
def lemmatize(x):
doc = nlp(x)
tokens = [token.lemma_ for token in doc if not (token.is_stop or token.is_punct)]
return ' '.join(tokens)
df["lemma_text"] = df["text"].apply(lemmatize)
#Saving our lemmatized reviews file
with open('Data/lemma_reviews.pkl', 'wb') as file:
dill.dump(df, file)
with open('Data/lemma_reviews.pkl', 'rb') as file:
df = dill.load(file)
We will begin by applying manual text vectorization and dimensionality reduction. In a following part, we will show how we can use the BERTopic module to extract topics of interest.
First, we need to turn our features into vectors. We will use the TF-IDF Vectorizer which is quite fast.
from sklearn.feature_extraction.text import TfidfVectorizer
X = df["lemma_text"]
model = TfidfVectorizer(lowercase=True, max_features=1000)
X_tr = model.fit_transform(X)
X_tr.shape
(486008, 1000)
This embedding algorithm has reduced the size of our text data to 1000 features. Now we will perform dimensionality reduction to be able to visualize our text data.
We reduce the dimensionality of our vector a first time before applying HDBSCAN clustering to speed up the clustering process. We will then further reduce the dimonsionality to 2 features for better visualization.
import umap
X_reduced = umap.UMAP(n_neighbors=15,
n_components=10,
metric='cosine').fit_transform(X_tr)
print(X_reduced.shape)
X_reduced
(486008, 10)
array([[ 9.560419 , 9.0328455, 3.0807197, ..., 5.0644646, 4.073515 ,
9.034298 ],
[ 9.71617 , 8.386079 , 2.5083742, ..., 5.7708173, 5.4900146,
10.477584 ],
[ 9.954774 , 10.604382 , 3.0913274, ..., 4.6353145, 3.0057309,
10.113983 ],
...,
[ 9.354248 , 9.711932 , 3.033503 , ..., 5.6380095, 4.4658704,
8.070437 ],
[10.355504 , 9.899299 , 2.5777116, ..., 4.508413 , 4.393826 ,
9.973897 ],
[10.280046 , 9.876802 , 2.5860057, ..., 4.5620923, 4.2433076,
10.015702 ]], dtype=float32)
Our first use of UMAP has divided by 100 our number of features. We can then find clusters using HDBSCAN to identify topics of interest.
Now that we've reduced the dimensionality to 10, we will apply HDBSCAN to find clusters (or topics) which we will then visualize by reducing the dimensionality further to 2.
import cuml
cluster = cuml.HDBSCAN(min_cluster_size=15,
metric='euclidean',
cluster_selection_method='eom',
verbose = True).fit(X_reduced)
print("Clustering completed")
#Visualization, reapplying UMAP
umap_viz = umap.UMAP(n_neighbors=15, n_components=2, verbose=True, metric='cosine').fit_transform(X_tr)
result = pd.DataFrame(umap_viz, columns=['x', 'y'])
result['labels'] = cluster.labels_
# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=0.05)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=0.05, cmap='hsv_r')
plt.colorbar()
Clustering completed UMAP(angular_rp_forest=True, metric='cosine', verbose=True) Mon Oct 10 17:18:00 2022 Construct fuzzy simplicial set Mon Oct 10 17:18:00 2022 Finding Nearest Neighbors Mon Oct 10 17:18:00 2022 Building RP forest with 40 trees Mon Oct 10 17:19:16 2022 metric NN descent for 19 iterations 1 / 19 2 / 19 3 / 19 4 / 19 5 / 19 6 / 19 7 / 19 8 / 19 9 / 19 10 / 19 Stopping threshold met -- exiting after 10 iterations Mon Oct 10 17:27:28 2022 Finished Nearest Neighbor Search Mon Oct 10 17:27:30 2022 Construct embedding
Epochs completed: 0%| 0/200 [00:00]
Mon Oct 10 17:32:02 2022 Finished embedding
<matplotlib.colorbar.Colorbar at 0x7f270c279af0>
len(result.labels.unique())
976
This method has identified 976 topics of interest, which is quite a high number. It would probably be interesting to optimize the hyperparameters of our UMAP and HDBSCAN algorithms to find a better number of clusters.
Visualization with UMAP shows some clusters but there are too many to be able to tell if this method was useful at separating different clusters.
We will now use the BERTopic package to identify topics of dissatisfaction. Its pipeline is shown below :
Since we select default settings, it will perform embedding using SBERT, Dimensionality Reduction using UMAP and clustering with HDBSCAN. After completion, we will perform the embedding, dimensionality reduction and clustering steps manually to show alternate methods.
from bertopic import BERTopic
import hdbscan
%set_env TOKENIZERS_PARALLELISM=True
# -- Custom HDBSCAN
bertopic_params = {}
bertopic_params['hdbscan_model'] = hdbscan.HDBSCAN(min_cluster_size=10,
metric='euclidean',
cluster_selection_method='eom',
prediction_data=True,
core_dist_n_jobs=1)
topic_model = BERTopic(language="english", verbose=True, **bertopic_params)
topics, probs = topic_model.fit_transform(df["lemma_text"].to_list())
env: TOKENIZERS_PARALLELISM=True
Batches: 0%| | 0/15188 [00:00<?, ?it/s]
2022-10-12 19:55:17,573 - BERTopic - Transformed documents to Embeddings 2022-10-12 20:01:26,665 - BERTopic - Reduced dimensionality 2022-10-12 20:03:22,968 - BERTopic - Clustered reduced embeddings
#Saving our topic_model
with open('Data/topic_model.pkl', 'wb') as file:
dill.dump(topic_model, file)
#Saving the topics and probabilities
with open('Data/topics.pkl', 'wb') as file:
dill.dump(topics, file)
with open('Data/probs.pkl', 'wb') as file:
dill.dump(probs, file)
with open('Data/topic_model.pkl', 'rb') as file:
topic_model = dill.load(file)
topic_model.visualize_barchart(top_n_topics=8)
We can see that some identified topics are relevant, like Topic 2 and 7, but some other topics just regroup types of food.
We will now perform Sentiment analysis on the words identified in each topic in order to filter out Neutral topics (i.e. based on the type of food) and only select topics including negative sentiment.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
def score_topics_polarity(model, number):
analyzer = SentimentIntensityAnalyzer()
scores = []
for i in range(number):
word_list = []
for t in topic_model.get_topic(i):
word_list.append(t[0])
words = ' '.join(word_list)
scores.append({'Number': i, 'Polarity': analyzer.polarity_scores(words)["compound"]})
return pd.DataFrame(scores)
topic_polarity = score_topics_polarity(topic_model, 50)
topic_polarity.head()
| Number | Polarity | |
|---|---|---|
| 0 | 0 | 0.0000 |
| 1 | 1 | 0.0000 |
| 2 | 2 | -0.0516 |
| 3 | 3 | 0.0000 |
| 4 | 4 | -0.5719 |
As is seen here, most topics do not contain any polarized world and have a neutral polarity. Other topics, like the previously identified Topic 2, have a negative polarity.
It is those topics we want to select, so let's filter our topics by negativity:
negative_topics = topic_polarity[topic_polarity.Polarity < 0]["Number"].to_list()
topic_model.visualize_barchart(negative_topics, n_words=10)
We can see that most of the identified topics are relevant, for easier display, we will plot Wordclouds for each of those topics :
# import the wordcloud library
from wordcloud import WordCloud
# Instantiate a new wordcloud.
wordcloud = WordCloud(random_state = 8,
normalize_plurals = False,
width = 600, height= 300,
max_words = 300,
stopwords = [])
# Apply the wordcloud to the text.
def generate_topic_wordclouds(topic_indexes):
for i in topic_indexes:
word_dict = {}
for t in topic_model.get_topic(i):
word_dict[t[0]]=int(t[1]*10000)
wordcloud.generate_from_frequencies(word_dict)
fig, ax = plt.subplots(1,1, figsize = (9,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
generate_topic_wordclouds(negative_topics)
This concludes our topic analysis. The topics are ranked by frequency so our client can increase his customer base by avoiding this common pitfalls.
Here are the top 3 identified topics of dissatisfaction :
In this part of the project, we will show how our dataset can be updated in production by querying 200 additional reviews from the Yelp API.
import cred
import requests
api_key = cred.api_key
loc_list = ['USA', 'NY', 'LA', 'Washington', 'DC', 'SF', 'Chicago'] #Iterate over different locations to avoid duplicates
categories = 'Restaurants'
attributes = 'hot_and_new' #Retrieve only new businesses to avoid duplicates with "old" dataframe
SEARCH_LIMIT = 50 #Search limit
def retrieve_yelp_reviews(loca_list=loc_list, n_businesses=200, n_reviews=600):
biz_url = 'https://api.yelp.com/v3/businesses/search'
headers = {
'Authorization': 'Bearer {}'.format(api_key),
}
responses = []
for loc in locations:
url_params = {
'location': loc + '+',
'categories': categories + '+',
'attributes': attributes + '+',
'limit': SEARCH_LIMIT
}
response = requests.get(biz_url, headers=headers, params=url_params)
#Checking for valid response code status
if response.status_code == 200:
responses += response.json()['businesses']
if len(responses) >= n_businesses:
break
new_biz = pd.DataFrame.from_dict(responses)
new_rev = []
for i in new_biz["id"].to_list():
url = "https://api.yelp.com/v3/businesses/" + str(i) + "/reviews"
response = requests.get(url, headers=headers, params=None)
if response.json():
new_rev += response.json()['reviews']
new_rev = pd.DataFrame(new_rev)
return new_biz, new_rev
new_biz, new_rev = retrieve_yelp_reviews()
new_biz
| id | alias | name | image_url | is_closed | url | review_count | categories | rating | coordinates | transactions | price | location | phone | display_phone | distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wGl_DyNxSv8KUtYgiuLhmA | bi-rite-creamery-san-francisco | Bi-Rite Creamery | https://s3-media3.fl.yelpcdn.com/bphoto/c5-w8m... | False | https://www.yelp.com/biz/bi-rite-creamery-san-... | 9911 | [{'alias': 'icecream', 'title': 'Ice Cream & F... | 4.5 | {'latitude': 37.761591, 'longitude': -122.425717} | [delivery] | $$ | {'address1': '3692 18th St', 'address2': None,... | +14156265600 | (415) 626-5600 | 946.386739 |
| 1 | lJAGnYzku5zSaLnQ_T6_GQ | brendas-french-soul-food-san-francisco-6 | Brenda's French Soul Food | https://s3-media4.fl.yelpcdn.com/bphoto/VJ865E... | False | https://www.yelp.com/biz/brendas-french-soul-f... | 11721 | [{'alias': 'breakfast_brunch', 'title': 'Break... | 4.0 | {'latitude': 37.7829016035273, 'longitude': -1... | [delivery] | $$ | {'address1': '652 Polk St', 'address2': '', 'a... | +14153458100 | (415) 345-8100 | 2885.389131 |
| 2 | WavvLdfdP6g8aZTtbBQHTw | gary-danko-san-francisco | Gary Danko | https://s3-media3.fl.yelpcdn.com/bphoto/eyYUz3... | False | https://www.yelp.com/biz/gary-danko-san-franci... | 5748 | [{'alias': 'newamerican', 'title': 'American (... | 4.5 | {'latitude': 37.80587, 'longitude': -122.42058} | [] | $$$$ | {'address1': '800 N Point St', 'address2': '',... | +14157492060 | (415) 749-2060 | 5191.341803 |
| 3 | ri7UUYmx21AgSpRsf4-9QA | tartine-bakery-san-francisco-3 | Tartine Bakery | https://s3-media4.fl.yelpcdn.com/bphoto/QRbC0T... | False | https://www.yelp.com/biz/tartine-bakery-san-fr... | 8530 | [{'alias': 'bakeries', 'title': 'Bakeries'}, {... | 4.0 | {'latitude': 37.76131, 'longitude': -122.42431} | [delivery] | $$ | {'address1': '600 Guerrero St', 'address2': ''... | +14154872600 | (415) 487-2600 | 1087.638933 |
| 4 | 76smcUUGRvq3k1MVPUXbnA | mitchells-ice-cream-san-francisco | Mitchells Ice Cream | https://s3-media2.fl.yelpcdn.com/bphoto/f4lzrs... | False | https://www.yelp.com/biz/mitchells-ice-cream-s... | 4530 | [{'alias': 'icecream', 'title': 'Ice Cream & F... | 4.5 | {'latitude': 37.744221, 'longitude': -122.422791} | [pickup, delivery] | $ | {'address1': '688 San Jose Ave', 'address2': '... | +14156482300 | (415) 648-2300 | 2209.260424 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | w11bYFeSydqdUpuyEJoXkg | rachel-lake-cle-elum | Rachel Lake | https://s3-media1.fl.yelpcdn.com/bphoto/c2owVq... | False | https://www.yelp.com/biz/rachel-lake-cle-elum?... | 10 | [{'alias': 'hiking', 'title': 'Hiking'}, {'ali... | 3.5 | {'latitude': 47.19518, 'longitude': -120.93829} | [] | NaN | {'address1': '', 'address2': '', 'address3': '... | 9218.085989 | ||
| 196 | BvJEM79soFlapfgHIngnpA | keg-cellar-tavern-cle-elum | Keg Cellar Tavern | https://s3-media1.fl.yelpcdn.com/bphoto/w38HE2... | False | https://www.yelp.com/biz/keg-cellar-tavern-cle... | 6 | [{'alias': 'bars', 'title': 'Bars'}] | 4.0 | {'latitude': 47.1943740844727, 'longitude': -1... | [] | $$ | {'address1': '112 N Pennsylvania Ave', 'addres... | +15096742277 | (509) 674-2277 | 9283.402199 |
| 197 | 4_m5m6ciDSEC7F3la3C_zQ | gravity-coffee-cle-elum-cle-elum | Gravity Coffee - Cle Elum | https://s3-media1.fl.yelpcdn.com/bphoto/3jEjYK... | False | https://www.yelp.com/biz/gravity-coffee-cle-el... | 9 | [{'alias': 'coffee', 'title': 'Coffee & Tea'},... | 4.0 | {'latitude': 47.19498, 'longitude': -120.95508} | [] | NaN | {'address1': '808 West Davis St', 'address2': ... | +12534478740 | (253) 447-8740 | 9886.047475 |
| 198 | QY6Q8bwDfQ6PZagXWvVcvw | kodiak-coffee-roslyn | Kodiak Coffee | https://s3-media3.fl.yelpcdn.com/bphoto/MzvoJo... | False | https://www.yelp.com/biz/kodiak-coffee-roslyn?... | 11 | [{'alias': 'coffee', 'title': 'Coffee & Tea'}] | 3.5 | {'latitude': 47.2078147610546, 'longitude': -1... | [] | NaN | {'address1': '3172 WA-903', 'address2': '', 'a... | +15096493398 | (509) 649-3398 | 10128.139910 |
| 199 | h2SS90lvvuHrupQ7xWVh1Q | 56-degrees-cle-elum | 56 Degrees | https://s3-media2.fl.yelpcdn.com/bphoto/Eikx0k... | False | https://www.yelp.com/biz/56-degrees-cle-elum?a... | 23 | [{'alias': 'newamerican', 'title': 'American (... | 2.5 | {'latitude': 47.2086001553586, 'longitude': -1... | [] | $$ | {'address1': '3600 Suncadia Trl', 'address2': ... | +15096496474 | (509) 649-6474 | 12329.965839 |
200 rows × 16 columns
Above is the list of businesses we collected. We gathered the information on 200 businesses.
Note : It is necessary to include a location in the Yelp API business search. We have included a list of common US cities that we can iterate over but it would be more relevant to focus on the cities located in the vicinity of Good Dinner's Restaurants.
print("Number of reviews collected : {}".format(len(new_rev)))
new_rev.head()
Number of reviews collected : 600
| id | url | text | rating | time_created | user | |
|---|---|---|---|---|---|---|
| 0 | d9min1nLES_aJv-aEoLilQ | https://www.yelp.com/biz/bi-rite-creamery-san-... | Hands down my favorite ice cream spot in San F... | 5 | 2022-09-05 15:24:19 | {'id': 'SXJmAdpip5_vFFHPj7lwJQ', 'profile_url'... |
| 1 | yowp01_Ji6bnU5_OhRgxjA | https://www.yelp.com/biz/bi-rite-creamery-san-... | Really unique flavors -- I LOVED the ritual co... | 4 | 2022-10-02 11:17:03 | {'id': 'Jl1z3Wylzhb25XejTxPHtQ', 'profile_url'... |
| 2 | ljntTVjfL2BQPaiLMPZsGw | https://www.yelp.com/biz/bi-rite-creamery-san-... | We love Bi-Rite Creamery, especially for their... | 4 | 2022-09-23 12:09:34 | {'id': 'tctoFsg9byYvQ7OhdfzrvQ', 'profile_url'... |
| 3 | 2fji5yUnZHTW-lZzoqaEqA | https://www.yelp.com/biz/brendas-french-soul-f... | Ah Brenda... you did not disappoint. \nFried C... | 5 | 2022-10-09 09:33:09 | {'id': 'vdyw_IXcFpfj6RrpUesPgw', 'profile_url'... |
| 4 | KfHGYGVl5PQqLTt4GgZB1w | https://www.yelp.com/biz/brendas-french-soul-f... | Service was horrible from our server, others a... | 3 | 2022-10-10 05:28:31 | {'id': 'WXoYGyHE5UrSC0YOd5e-7w', 'profile_url'... |
Another limitation of the Yelp API is that only 3 reviews per business are retrievable. The reviews recovered are randomized so it would be possible to rerun the queries several times and gather different reviews.
Above is the list of the 600 recovered reviews.
Now it would be interesting to only keep negative reviews and so to perform the same filtering as in our preprocessing (only keeping the reviews with 1 or 2 stars and with negative polarity*
new_rev["polarity"] = new_rev["text"].apply(analyze_polarity)
new_rev["lemma_text"] = new_rev["text"].apply(lemmatize)
#Keeping only negative reviews
new_df = new_rev.loc[(new_rev.polarity < 0) & (new_rev.rating <= 2)].copy()
#Updating the topics with the new data
#topic_model.update_topics(new_revs["lemma_text"].to_list())
print(new_df.shape)
#Only 14 neagtive reviews out of the 600 collected!
new_df.head()
(14, 8)
| id | url | text | rating | time_created | user | polarity | lemma_text | |
|---|---|---|---|---|---|---|---|---|
| 100 | tQN6XoPK3zXheBirOUYP_w | https://www.yelp.com/biz/limoncello-san-franci... | The owner of this business is absolutely insan... | 1 | 2022-09-05 13:55:32 | {'id': 'AiCTjZiyaZ8bx-j_mqzdbg', 'profile_url'... | -0.201133 | owner business absolutely insane clue customer... |
| 106 | bbwo1l3OZCNoQkpABvj8EA | https://www.yelp.com/biz/arizmendi-bakery-san-... | I was craving some sweets and I went to this s... | 2 | 2022-10-06 17:11:44 | {'id': '5g969cG9I994x6OLGiP0SQ', 'profile_url'... | -0.003343 | crave sweet go store \n ask chocolate thing \n... |
| 196 | si-G-TWKkyCO4LvTSWRgsA | https://www.yelp.com/biz/peter-luger-brooklyn-... | [A racist establishment- do not frequent] \nI ... | 1 | 2022-10-04 20:37:18 | {'id': 'AMQofGG8AmqE6BW79gqFbQ', 'profile_url'... | -0.153680 | racist establishment- frequent \n rarely write... |
| 352 | 41A5RZNPQ-MIdCGLWg1Llg | https://www.yelp.com/biz/hae-jang-chon-los-ang... | good food but bad service,\nwe felt rushed and... | 2 | 2022-10-06 15:24:36 | {'id': 'I5owhnCBGBsEX1FucVli6g', 'profile_url'... | -0.160700 | good food bad service \n feel rush want contro... |
| 407 | PB4F4pBQnDZT9dmLEgSC5A | https://www.yelp.com/biz/daves-hot-chicken-los... | From hero's to franchise owners. \nQuality at ... | 2 | 2022-09-04 16:19:34 | {'id': 'kYWig-IQj7noJ7F3-h1G9A', 'profile_url'... | -0.014557 | hero franchise owner \n quality location drast... |
Out of these 600 reviews, only 14 are negative, so we will need to run a lot of API queries to be able to significantly impact our BERTopic algorithm.
We can now save the complete list of retrieved businesses and reviews for further use.
new_rev.to_csv("new_rev.csv")
new_biz.to_csv("new_biz.csv")
Now that we've completed Yelp review data processing, we will investigate the Photo database.
Our main goal is to classify pictures based on the Yelp labels.
5 labels have been identified :
We will first load the image paths so that we can retrieve them later on.
from skimage import io
import os
import glob
data_path = "Data/Photos"
#Retrieving photo names
photos_path = os.path.join(data_path, '*')
photos_path = glob.glob(photos_path)
Let's visualize a random image :
#Sampling a random image
image = io.imread(photos_path[5])
#Plotting the image
i, (im1) = plt.subplots(1)
i.set_figwidth(15)
im1.imshow(image)
plt.grid(None)
plt.show()
It works! Now we can turn our photo.json file into a Dataframe to be able to retrieve information and more importantly the labels of these photos.
photos = pd.read_json("Data/photos.json", lines=True)
photos.head()
| photo_id | business_id | caption | label | |
|---|---|---|---|---|
| 0 | zsvj7vloL4L5jhYyPIuVwg | Nk-SJhPlDBkAZvfsADtccA | Nice rock artwork everywhere and craploads of ... | inside |
| 1 | HCUdRJHHm_e0OCTlZetGLg | yVZtL5MmrpiivyCIrVkGgA | outside | |
| 2 | vkr8T0scuJmGVvN2HJelEA | _ab50qdWOk0DdB6XOrBitw | oyster shooter | drink |
| 3 | pve7D6NUrafHW3EAORubyw | SZU9c8V2GuREDN5KgyHFJw | Shrimp scampi | food |
| 4 | H52Er-uBg6rNrHcReWTD2w | Gzur0f0XMkrVxIwYJvOt2g | food |
We will also add the path to our photos in the dataframe :
def add_path(x):
return "Data/Photos/" + str(x)+".jpg"
photos["photo_path"] = photos["photo_id"].apply(add_path)
photos.head()
| photo_id | business_id | caption | label | photo_path | |
|---|---|---|---|---|---|
| 0 | zsvj7vloL4L5jhYyPIuVwg | Nk-SJhPlDBkAZvfsADtccA | Nice rock artwork everywhere and craploads of ... | inside | Data/Photos/zsvj7vloL4L5jhYyPIuVwg.jpg |
| 1 | HCUdRJHHm_e0OCTlZetGLg | yVZtL5MmrpiivyCIrVkGgA | outside | Data/Photos/HCUdRJHHm_e0OCTlZetGLg.jpg | |
| 2 | vkr8T0scuJmGVvN2HJelEA | _ab50qdWOk0DdB6XOrBitw | oyster shooter | drink | Data/Photos/vkr8T0scuJmGVvN2HJelEA.jpg |
| 3 | pve7D6NUrafHW3EAORubyw | SZU9c8V2GuREDN5KgyHFJw | Shrimp scampi | food | Data/Photos/pve7D6NUrafHW3EAORubyw.jpg |
| 4 | H52Er-uBg6rNrHcReWTD2w | Gzur0f0XMkrVxIwYJvOt2g | food | Data/Photos/H52Er-uBg6rNrHcReWTD2w.jpg |
Now just as with the reviews, we are only interested in photos from Restaurants, so we will filter out our photos database based on the business list previously identified.
#Joining on our restaurant business table to keep only restaurant photos
print(len(photos))
photos = pd.merge(photos, business, on="business_id", how="inner")
print(len(photos))
199994 170484
This reduces the size of our database from about 200k samples to 170k samples (15% reduction).
Now we also need to check the validity of our photo files and that all of the photo referenced in our database are actually present within our source folder. We will use OpenCV to read each image and then to check if they are corrupted.
After that, we will remove corrupted images from our database.
#After running some iterations, we have realized that some images are corrupted
#This function will verify the path of each image using cv2
import cv2
def verify_path(x):
img = cv2.imread(x)
if img is None:
return np.nan
else:
return x
photos["photo_path"] = photos["photo_path"].apply(verify_path)
libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile libpng warning: iCCP: known incorrect sRGB profile
photos = photos[photos.photo_path.notna()]
#5 occurences were removed
import dill
with open('Data/photos.pkl', 'wb') as file:
dill.dump(photos, file)
This has removed 5 corrupted or missing photos from our database.
Now that our photo database has been cleaned, we are ready to start preprocessing.
We will now apply preprocessing steps to our images. Instead of applying these functions to all our images, which would be very long, we apply them to only a batch of images.
We will then apply our preprocessing using Tensorflow and Keras in order to automate the preprocessing.
Greyscaling could be a necessary preprocessing step because it reduces the number of channels from 3 to 1 which reduces image size and smoothes out results. Of course, our algorithm would then be unable to differentiate between colors which could be a problem.
#Testing manual greyscaling
from cucim.skimage.color import rgb2gray
import cupy as cp
img_grey = rgb2gray(cp.array(image))
plt.imshow(cp.asnumpy(img_grey))
<matplotlib.image.AxesImage at 0x7f28b3695130>
We will now use the cupy package to perform a basic normalization of our image :
#Image normalization with cupy
img_norm = (img_grey - cp.min(img_grey))/ (cp.max(img_grey) - cp.min(img_grey))
plt.imshow(cp.asnumpy(img_norm))
<matplotlib.image.AxesImage at 0x7f28b3f47fd0>
We will now switch from basic preprocessing steps to Tensorflow and Keras in order to reduce RAM usage and to be able to reproduce the preprocessing steps at scale more easily.
The first thing we need is load our database on Tensorflow.
We will perform greyscale conversion when loading the images in order to reduce the overhead. The load function also automatically reduces the image to the desired scale, here 160x160.
#Using tensorflow and keras
import PIL
import PIL.Image
import tensorflow as tf
import tensorflow_datasets as tfds
BATCH_SIZE = 32
IMG_SIZE = 160
test = photos[photos.label.notna()].head(500) #Creating a sample of our dataset to test deployment
x = test["photo_path"]
y = test["label"]
def load(file_path):
img = tf.io.read_file(file_path)
img = tf.image.decode_png(img, channels=3)
img = tf.image.convert_image_dtype(img, tf.float32)
img = tf.image.resize(img, size=(IMG_SIZE, IMG_SIZE)) #Resize
img = tf.image.rgb_to_grayscale(img) #Converts to greyscale
return img
photo_ds = tf.data.Dataset.from_tensor_slices((x,y)).map(lambda x,y : (load(x), y))
next(iter(photo_ds))
['inside' 'food' 'drink' 'outside' 'menu']
(<tf.Tensor: shape=(160, 160, 1), dtype=float32, numpy=
array([[[0.08659334],
[0.08659334],
[0.08659334],
...,
[0.255501 ],
[0.25374258],
[0.26273412]],
[[0.08659334],
[0.08659334],
[0.08659334],
...,
[0.26046377],
[0.26022068],
[0.2752608 ]],
[[0.08571423],
[0.08571423],
[0.08571423],
...,
[0.26663992],
[0.2686267 ],
[0.27975687]],
...,
[[0.27996668],
[0.3602888 ],
[0.35526276],
...,
[0.02429049],
[0.03628377],
[0.03729754]],
[[0.3682305 ],
[0.4474603 ],
[0.58711964],
...,
[0.08658904],
[0.06012097],
[0.0773527 ]],
[[0.39774716],
[0.3790933 ],
[0.35402676],
...,
[0.05249134],
[0.06983256],
[0.07253755]]], dtype=float32)>,
<tf.Tensor: shape=(), dtype=string, numpy=b'inside'>)
AUTOTUNE = tf.data.AUTOTUNE
def configure_for_performance(ds):
ds = ds.cache()
ds = ds.shuffle(buffer_size=1000)
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
return ds
photo_ds = configure_for_performance(photo_ds)
Now let's verify that the photos are correctly loaded into the tensorflow database :
image_batch, label_batch = next(iter(photo_ds))
plt.figure(figsize=(10, 10))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image_batch[i].numpy())
plt.title(label_batch[i].numpy().decode('UTF-8'))
plt.axis("off")
Now that our photos are correctly loaded, let's continue performing some preprocessing steps.
We will now standardize our images by Rescaling them.
Let's look at our results :
#Standardizing our images
normalization_layer = tf.keras.layers.Rescaling(1./255)
normalized_ds = photo_ds.map(lambda x,y : (normalization_layer(x), y))
image_batch, label_batch = next(iter(normalized_ds))
#Visualizing 9 standardized images
plt.figure(figsize=(10, 10))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image_batch[i].numpy())
plt.title(label_batch[i].numpy().decode('UTF-8'))
plt.axis("off")
This hasn't a big effect visually but will allow our images to be processed by classification or feature extraction algorithms.
We will now perform histogram equalization of our images.
There is no available function on Tensorflow to perform equalization, so I had to implement a custom Keras layer.
Let's look at our results :
#Defining custom KERA layers
from skimage import exposure
from tensorflow.keras.layers import Layer, Input, Conv2D
from tensorflow.keras.models import Model
def equalize(img):
for channel in range(img.shape[2]): # equalizing each channel
img[:, :, channel] = exposure.equalize_hist(img[:, :, channel])
return img.astype(np.float32)
def preprocess_input(img):
x = tf.numpy_function(equalize,
[img],
'float32',
name='histogram_equalization')
return tf.cast(x, tf.float32)
class EqualizingLayer(Layer):
def __init__(self, **kwargs):
self.trainable = False
super(EqualizingLayer, self).__init__(**kwargs)
def compute_output_shape(self, input_shape):
return ((input_shape[0], input_shape[1], input_shape[2], input_shape[3]))
def build(self, input_shape):
super().build(input_shape)
def call(self, x):
res = tf.map_fn(preprocess_input, x)
res.set_shape(self.compute_output_shape(x.get_shape())) #No change to the shape
return res
equalizing_layer = EqualizingLayer()
equalized_ds = normalized_ds.map(lambda x,y : (equalizing_layer(x), y))
image_batch, label_batch = next(iter(equalized_ds))
#Visualizing 9 standardized images
plt.figure(figsize=(10, 10))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image_batch[i].numpy())
plt.title(label_batch[i].numpy().decode('UTF-8'))
plt.axis("off")
It is clear that performing histogram equalization has improved the shape of our images, the brighness has been smoothed out allowing us to see parts of the photos that were previously dark.
To improve our algorithms, we will now perform data augmentation on our images, by performing random operations and adding Gaussian noise. The main use of data augmentation is to prevent overfitting by our algorithm. It is also useful to create new samples if our initial database was to small, which is not the case here.
We will perform 4 data augmentation operations here :
#Performing data augmentation
from tensorflow.keras import layers
data_augmentation = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical"),
layers.RandomRotation(0.2),
layers.RandomZoom(0.1),
layers.GaussianNoise(1e-4),
])
augmented_ds = equalized_ds.map(lambda x,y: (data_augmentation(x, training=True), y))
image_batch, label_batch = next(iter(augmented_ds))
#Visualizing 9 standardized images
plt.figure(figsize=(10, 10))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image_batch[i].numpy())
plt.title(label_batch[i].numpy().decode('UTF-8'))
plt.axis("off")
WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op.
We can clearly see that our data augmentation steps have worked as some images have been flipped.
#Summarizing our preprocessing steps:
pre_processing = tf.keras.Sequential([
layers.Rescaling(1./255)
EqualizingLayer(),
layers.GaussianNoise(1e-4)
layers.RandomFlip("horizontal_and_vertical"),
layers.RandomRotation(0.2),
layers.RandomZoom(0.1),
])
#Converting to greyscale and correcting image_size has been done on dataset generation
Now that we have completed preprocessing, we are still unable to visualize the shape of our image database because our images are still 160 160 3 vectors.
In order to perform feature extraction, we will use Transfer Learning and use the Google MobileNetV2 algorithm to extract important features from our images.
We will now load the full dataset to Tensorflow :
with open('Data/photos.pkl', 'rb') as file:
photos = dill.load(file)
#Using tensorflow and keras
import PIL
import PIL.Image
import tensorflow as tf
import tensorflow_datasets as tfds
from sklearn.model_selection import train_test_split
from category_encoders.ordinal import OrdinalEncoder
TEST_SIZE = 0.1
df = photos[photos.label.notna()][["photo_path","label"]] #Loading full dataset with no missing labels
#Defining label mapping for further use
label_mapping = [{'col': 'label', 'mapping':{'food': 0, 'drink':1, 'inside':2, 'outside': 3, 'menu': 4}}]
df = OrdinalEncoder(cols="label", mapping=label_mapping).fit_transform(df)
train, test = train_test_split(df, test_size=TEST_SIZE)
train, val = train_test_split(train, test_size =TEST_SIZE/(1-TEST_SIZE))
X_train, X_test, X_val = train["photo_path"], test["photo_path"], val["photo_path"]
y_train, y_test, y_val = train["label"], test["label"], val["label"]
IMG_SIZE = 160
BATCH_SIZE = 32
def load(file_path):
img = tf.io.read_file(file_path)
img = tf.image.decode_png(img, channels=3)
img = tf.image.convert_image_dtype(img, tf.float32)
img = tf.image.resize(img, size=(IMG_SIZE, IMG_SIZE)) #Resize
#img = tf.image.rgb_to_grayscale(img) #We do not perform grayscale conversion for this model
return img
train_ds = tf.data.Dataset.from_tensor_slices((X_train,y_train)).map(lambda x,y : (load(x), y))
test_ds = tf.data.Dataset.from_tensor_slices((X_test,y_test)).map(lambda x,y : (load(x), y))
val_ds = tf.data.Dataset.from_tensor_slices((X_val,y_val)).map(lambda x,y : (load(x), y))
next(iter(train_ds))
(<tf.Tensor: shape=(160, 160, 3), dtype=float32, numpy=
array([[[0.16470589, 0.16470589, 0.13333334],
[0.16078432, 0.16078432, 0.12941177],
[0.15686275, 0.15686275, 0.1254902 ],
...,
[0.10980393, 0.10196079, 0.05147059],
[0.10196079, 0.10196079, 0.07058824],
[0.10588236, 0.10588236, 0.07450981]],
[[0.16470589, 0.16470589, 0.13333334],
[0.16078432, 0.16078432, 0.12941177],
[0.15686275, 0.15686275, 0.1254902 ],
...,
[0.11078432, 0.10294119, 0.05245098],
[0.10588236, 0.10588236, 0.07450981],
[0.10980393, 0.10980393, 0.07843138]],
[[0.16470589, 0.16470589, 0.13333334],
[0.16078432, 0.16078432, 0.12941177],
[0.15686275, 0.15686275, 0.1254902 ],
...,
[0.11629903, 0.10330883, 0.05539216],
[0.11433824, 0.10551471, 0.07120098],
[0.11752452, 0.10870099, 0.07438726]],
...,
[[0.02267157, 0.01090686, 0. ],
[0.02843137, 0.01666667, 0. ],
[0.03235294, 0.02058824, 0.00098039],
...,
[0.2927696 , 0.31850493, 0.34203434],
[0.2824755 , 0.3139706 , 0.32598042],
[0.37426472, 0.39828435, 0.41004905]],
[[0.03186275, 0.02009804, 0.00110294],
[0.03529412, 0.02352941, 0.00392157],
[0.03235294, 0.02058824, 0.00098039],
...,
[0.2507353 , 0.27169117, 0.28468138],
[0.37622553, 0.40551475, 0.42598042],
[0.38676476, 0.41078436, 0.42254907]],
[[0.02855392, 0.01678922, 0. ],
[0.03394608, 0.02218137, 0.00257353],
[0.03627451, 0.02450981, 0.00490196],
...,
[0.2839461 , 0.30012256, 0.31188726],
[0.41372553, 0.44901964, 0.46862748],
[0.36164218, 0.38566178, 0.3974265 ]]], dtype=float32)>,
<tf.Tensor: shape=(), dtype=int64, numpy=0>)
It is not necessary to perform Greyscale conversion, so we will slightly modify our preprocessing pipeline to accomodate this.
We will now load the model and its preprocessing function that will be used in our preprocessing pipeline.
#We will now perform feature selection using the mobilenetV2 model.
#We have to use the model's preprocessing function, which just adds a Rescaling Layer
preprocess_input = tf.keras.applications.mobilenet_v2.preprocess_input
IMG_SHAPE = (IMG_SIZE,IMG_SIZE) + (3,)
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
base_model.trainable = False
AUTOTUNE = tf.data.AUTOTUNE
def configure_for_performance(ds):
ds = ds.cache()
ds = ds.batch(BATCH_SIZE)
#ds = ds.prefetch(buffer_size=AUTOTUNE)
return ds
# train_ds = configure_for_performance(train_ds)
# test_ds = configure_for_performance(test_ds)
# val_ds = configure_for_performance(val_ds)
#Defining custom KERA layers
from skimage import exposure
from tensorflow.keras.layers import Layer, Input, Conv2D
from tensorflow.keras.models import Model
def equalize(img):
for channel in range(img.shape[2]): # equalizing each channel
img[:, :, channel] = exposure.equalize_hist(img[:, :, channel])
return img.astype(np.float32)
def preprocess_input(img):
x = tf.numpy_function(equalize,
[img],
'float32',
name='histogram_equalization')
return tf.cast(x, tf.float32)
class EqualizingLayer(Layer):
def __init__(self, **kwargs):
self.trainable = False
super(EqualizingLayer, self).__init__(**kwargs)
def compute_output_shape(self, input_shape):
return ((input_shape[0], input_shape[1], input_shape[2], input_shape[3]))
def build(self, input_shape):
super().build(input_shape)
def call(self, x):
res = tf.map_fn(preprocess_input, x)
res.set_shape(self.compute_output_shape(x.get_shape())) #No change to the shape
return res
from tensorflow.keras import layers
import warnings
#Defining our preprocessing steps
pre_process = tf.keras.Sequential([
layers.Rescaling(1./127.5, offset=-1), #Our model required preprocessing rescaling
EqualizingLayer(), #Our custom Histogram equalization method
])
data_augmentation = tf.keras.Sequential([
layers.GaussianNoise(1e-4), #Smoothing our image with gaussian noise
layers.RandomFlip("horizontal_and_vertical"),
layers.RandomRotation(0.2),
layers.RandomZoom(0.1),
])
categorical_data_encoding = layers.CategoryEncoding(num_tokens=5, output_mode="one_hot")
def prepare(ds, shuffle=False, augment=False):
# ds = ds.map(lambda x, y: (pre_process(x), y)) #Integrated to model
#ds = ds.map(lambda x, y: (x, categorical_data_encoding(y))) #Performing one hot encoding on target variable
if shuffle:
ds.shuffle(1000)
#Use augmentation only on the training set
# if augment:
# ds = ds.map(lambda x, y: (data_augmentation(x, training=True), y)) #Integrated to model
return ds
with warnings.catch_warnings(): #This code produces many warnings because of a current tensorflow issue with data aug. layers
warnings.simplefilter("ignore")
train_ds = prepare(train_ds, shuffle=True, augment=True)
test_ds = prepare(test_ds)
val_ds = prepare(val_ds)
train_ds = configure_for_performance(train_ds)
test_ds = configure_for_performance(test_ds)
val_ds = configure_for_performance(val_ds)
We also automate the preprocessing step by including it in a function that we can then apply to our 3 datasets :
Let's look at the shape of our data after it has been processed by our base model :
image_batch, label_batch = next(iter(train_ds))
feature_batch = base_model(image_batch)
print(feature_batch.shape)
#Our base model converts our feature batch into a 5*5*1280 block of features
2022-10-12 08:18:35.849750: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
(32, 5, 5, 1280)
It is still multidimensional, to be able to perform data visualization, we need to reduce its shape to a 2D vector.
We will apply an additional layer to our model called Global Average Pooling 2D that will reduce the shape of our data to a 2D vector.
#To generate predictions, we need to turn our 5 X 5 X 1280 vector into 1280 features
#We will apply a GlobalAveragePooling2D layer to average out these 5 X 5 dimensions into a 2D vector
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
feature_batch_average = global_average_layer(feature_batch)
print(feature_batch_average.shape)
(32, 1280)
We can see that each image now has 1280 features. (The 32 indicates our batch number)
We will automate this process by regenerating a model and applying it to our validation set (for faster processing).
#We will modify our model to add the GlobalPooling Layer
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
base_model.trainable = False
inputs = tf.keras.Input(shape=(IMG_SIZE,IMG_SIZE,3))
x = base_model(inputs, training=False)
outputs = global_average_layer(x)
flatten_model = tf.keras.Model(inputs, outputs)
flatten_model.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 160, 160, 3)] 0
mobilenetv2_1.00_160 (Funct (None, 5, 5, 1280) 2257984
ional)
global_average_pooling2d (G (None, 1280) 0
lobalAveragePooling2D)
=================================================================
Total params: 2,257,984
Trainable params: 0
Non-trainable params: 2,257,984
_________________________________________________________________
train_ds_red = flatten_model.predict(train_ds)
train_ds_red.shape
152/4263 [>.............................] - ETA: 16:17
2022-10-13 06:52:22.144458: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
593/4263 [===>..........................] - ETA: 14:11
2022-10-13 06:54:03.575389: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1011/4263 [======>.......................] - ETA: 12:36
2022-10-13 06:55:41.321992: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1107/4263 [======>.......................] - ETA: 12:14
2022-10-13 06:56:03.705347: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1188/4263 [=======>......................] - ETA: 11:56
2022-10-13 06:56:22.714570: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1377/4263 [========>.....................] - ETA: 11:12
2022-10-13 06:57:06.731986: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1441/4263 [=========>....................] - ETA: 10:59
2022-10-13 06:57:22.566856: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1481/4263 [=========>....................] - ETA: 10:50
2022-10-13 06:57:31.841754: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1525/4263 [=========>....................] - ETA: 10:40
2022-10-13 06:57:42.546412: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1644/4263 [==========>...................] - ETA: 10:12
2022-10-13 06:58:10.352047: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1694/4263 [==========>...................] - ETA: 10:01
2022-10-13 06:58:22.617049: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1912/4263 [============>.................] - ETA: 9:12
2022-10-13 06:59:14.982394: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2095/4263 [=============>................] - ETA: 8:31
2022-10-13 06:59:59.859071: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2253/4263 [==============>...............] - ETA: 7:55
2022-10-13 07:00:39.317355: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2536/4263 [================>.............] - ETA: 6:50
2022-10-13 07:01:48.345910: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2550/4263 [================>.............] - ETA: 6:46
2022-10-13 07:01:51.796983: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2602/4263 [=================>............] - ETA: 6:34
2022-10-13 07:02:04.375986: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2603/4263 [=================>............] - ETA: 6:34
2022-10-13 07:02:04.608257: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2661/4263 [=================>............] - ETA: 6:21
2022-10-13 07:02:19.246332: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3117/4263 [====================>.........] - ETA: 4:33
2022-10-13 07:04:10.366373: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3365/4263 [======================>.......] - ETA: 3:35
2022-10-13 07:05:11.518628: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3608/4263 [========================>.....] - ETA: 2:37
2022-10-13 07:06:11.725403: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4029/4263 [===========================>..] - ETA: 56s
2022-10-13 07:07:54.930348: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4263/4263 [==============================] - 1027s 241ms/step
(136386, 1280)
with open("Data/train_reduced.pkl", "wb") as file:
dill.dump(train_ds_red, file)
with open("Data/train_reduced.pkl", "rb") as file:
train_ds_red = dill.load(file)
We can see that this has functionned, and it has transformed our validation dataset into a 136386 * 1280 numpy array.
We now need to recover the labels from the Tensorflow dataframe :
train_labels = np.concatenate([y for x, y in train_ds], axis=0)
train_labels.shape
2022-10-13 13:50:27.955572: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:36.206098: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:44.567073: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:46.075693: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:46.609357: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:50.032763: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:50.726493: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:51.822963: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:52.067416: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:50:53.969117: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:02.972625: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:06.101412: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:07.370873: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:09.019480: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:13.012292: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:23.076330: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:29.714270: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:34.765711: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:36.116822: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:43.967525: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:46.295308: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-13 13:51:47.764009: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
(136386,)
with open("Data/train_labels.pkl", "wb") as file:
dill.dump(train_labels, file)
with open("Data/train_labels.pkl", "rb") as file:
train_labels = dill.load(file)
Now that we have recovered the labels, we need to perform categorical feature encoding and turn them into numbers using a Label Encoder.
This will be used for the visualization of clusters.
# from sklearn.preprocessing import LabelEncoder
# train_labels = LabelEncoder().fit_transform(train_labels)
# train_labels[1]
We are now ready to apply dimensionality reduction to our dataset.
We will apply UMAP to reduce our dataset to 2 components and visualize it :
import umap.umap_ as umap
umap_viz = umap.UMAP(n_neighbors=15, n_components=3, verbose=True, metric='cosine').fit_transform(train_ds_red)
result = pd.DataFrame(umap_viz, columns=['x', 'y','z'])
result['labels'] = train_labels
UMAP(angular_rp_forest=True, metric='cosine', n_components=3, verbose=True) Thu Oct 20 14:13:47 2022 Construct fuzzy simplicial set Thu Oct 20 14:13:48 2022 Finding Nearest Neighbors Thu Oct 20 14:13:48 2022 Building RP forest with 23 trees Thu Oct 20 14:13:55 2022 NN descent for 17 iterations 1 / 17 2 / 17 3 / 17 4 / 17 5 / 17 6 / 17 7 / 17 8 / 17 Stopping threshold met -- exiting after 8 iterations Thu Oct 20 14:14:11 2022 Finished Nearest Neighbor Search Thu Oct 20 14:14:13 2022 Construct embedding
Epochs completed: 0%| 0/200 [00:00]
Thu Oct 20 14:14:57 2022 Finished embedding
import plotly.graph_objects as go
from plotly.subplots import make_subplots
#Retrieving label
label_mapping = [{'col': 'label', 'mapping':{'food': 0, 'drink':1, 'inside':2, 'outside': 3, 'menu': 4}}]
mapping = label_mapping[0]['mapping']
fig = make_subplots()
fig.update_layout(height=600, width=1000)
colors = {0: 'blue', 1: 'red', 2: 'green', 3: 'yellow', 4: 'brown'}
for i in range(5):
subset = result[result.labels == i]
fig.add_trace(go.Scatter3d(
x=subset['x'],
y=subset['y'],
z=subset['z'],
mode='markers',
name = [k for k,v in mapping.items() if v == i][0].capitalize(),
marker=dict(
size=3,
color=colors[i],
opacity=0.7
)
))
fig.update_layout(title="3D Visualization of our Image Data",
legend_title = 'Label',
font = dict(
size=18),
showlegend=True)
fig.show()
# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
plt.scatter(result.x, result.y, c=result.labels, s=0.9,cmap='hsv_r')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7fa9d8f3ca00>
We can see that this process has not cleanly separated clusters. A possible reason is that reducing a 1280 feature dataset into just 2 features has reduced the information too much.
We will now use our MobilNetV2 model to predict the labels of our photos.
At first we need to add layers to our model so that it can actually predict our five classes.
global_average_layer = tf.keras.layers.GlobalAveragePooling2D()
prediction_layer = tf.keras.Sequential([
layers.Dense(1024, input_dim=1280),
layers.LeakyReLU(),
layers.Dense(512),
layers.LeakyReLU(),
layers.Dense(256),
layers.LeakyReLU(),
layers.Dropout(.3),
layers.Dense(128),
layers.LeakyReLU(),
#layers.Dropout(.2),
layers.Dense(5),
layers.Softmax()
])
inputs = tf.keras.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
x = pre_process(inputs)
x = data_augmentation(x)
x = base_model(x, training=False)
x = global_average_layer(x)
outputs = prediction_layer(x)
model = tf.keras.Model(inputs, outputs)
Then we compile our new model and display our model summary :
model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])
# categorical cross entropy is taken since its used as a loss function for
# multi-class classification problems where there are two or more output labels.
# using Adam optimizer for better performance
# other optimizers such as sgd can also be used depending upon the model
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) [(None, 160, 160, 3)] 0
sequential_3 (Sequential) (None, 160, 160, 3) 0
sequential_4 (Sequential) (None, 160, 160, 3) 0
mobilenetv2_1.00_160 (Funct (None, 5, 5, 1280) 2257984
ional)
global_average_pooling2d_1 (None, 1280) 0
(GlobalAveragePooling2D)
sequential_5 (Sequential) (None, 5) 2001413
=================================================================
Total params: 4,259,397
Trainable params: 2,001,413
Non-trainable params: 2,257,984
_________________________________________________________________
Now we evaluate our model on the validation dataset to see what are its initial results :
loss0, accuracy0 = model.evaluate(val_ds)
401/533 [=====================>........] - ETA: 23s - loss: 1.5967 - accuracy: 0.2262
2022-10-12 11:02:30.849572: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile 2022-10-12 11:02:30.871841: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
533/533 [==============================] - 97s 179ms/step - loss: 1.5922 - accuracy: 0.2328
print("initial loss: {:.2f}".format(loss0))
print("initial accuracy: {:.2f}".format(accuracy0))
initial loss: 1.59 initial accuracy: 0.23
We can see that our untrained model has a very weak accuracy.
Let's now train our model and see how these results evolve :
For this phase, we freeze the base_model's hyperparameters using the trainable = False command. We actually fit our model only on the newly created layers.
Because of the limitation of my computer, it is only possible to train the model on 2 epochs (and half the training dataset) before we run into RAM issues.
To improve this further, we could run the model training on Azure.
initial_epochs = 5
import gc
#Defining custom Call Back to prevent memory leak
class MyCustomCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
gc.enable()
tf.keras.backend.clear_session() #Resets RAM usage after every EPOCH
gc.collect()
mem_clear = MyCustomCallback()
with tf.device('/cpu:0'): #GPU slows down calculation time
history = model.fit(train_ds, #No validation set is provided to prevent memory leak
epochs=initial_epochs,
callbacks=[mem_clear])
Epoch 1/5 WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting RngReadAndSkip cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting Bitcast cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting StatelessRandomUniformV2 cause there is no registered converter for this op. WARNING:tensorflow:Using a while_loop for converting ImageProjectiveTransformV3 cause there is no registered converter for this op. 47/4263 [..............................] - ETA: 18:32 - loss: 0.7547 - accuracy: 0.7693
2022-10-13 11:21:44.825785: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
94/4263 [..............................] - ETA: 18:18 - loss: 0.5828 - accuracy: 0.8085
2022-10-13 11:21:57.157532: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
441/4263 [==>...........................] - ETA: 16:57 - loss: 0.4442 - accuracy: 0.8455
2022-10-13 11:23:29.791571: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
471/4263 [==>...........................] - ETA: 16:53 - loss: 0.4375 - accuracy: 0.8479
2022-10-13 11:23:38.325015: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
645/4263 [===>..........................] - ETA: 16:07 - loss: 0.4251 - accuracy: 0.8520
2022-10-13 11:24:24.890400: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
735/4263 [====>.........................] - ETA: 15:44 - loss: 0.4197 - accuracy: 0.8551
2022-10-13 11:24:49.105155: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
853/4263 [=====>........................] - ETA: 15:11 - loss: 0.4136 - accuracy: 0.8568
2022-10-13 11:25:20.440783: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1573/4263 [==========>...................] - ETA: 11:56 - loss: 0.3847 - accuracy: 0.8677
2022-10-13 11:28:31.654764: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1589/4263 [==========>...................] - ETA: 11:52 - loss: 0.3843 - accuracy: 0.8679
2022-10-13 11:28:35.864316: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1786/4263 [===========>..................] - ETA: 10:59 - loss: 0.3806 - accuracy: 0.8695
2022-10-13 11:29:27.576058: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
1895/4263 [============>.................] - ETA: 10:29 - loss: 0.3778 - accuracy: 0.8706
2022-10-13 11:29:56.201668: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2172/4263 [==============>...............] - ETA: 9:14 - loss: 0.3718 - accuracy: 0.8721
2022-10-13 11:31:08.666962: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2211/4263 [==============>...............] - ETA: 9:04 - loss: 0.3709 - accuracy: 0.8723
2022-10-13 11:31:18.910327: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2462/4263 [================>.............] - ETA: 7:58 - loss: 0.3680 - accuracy: 0.8737
2022-10-13 11:32:25.910498: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
2687/4263 [=================>............] - ETA: 6:58 - loss: 0.3639 - accuracy: 0.8749
2022-10-13 11:33:26.461177: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3232/4263 [=====================>........] - ETA: 4:35 - loss: 0.3554 - accuracy: 0.8779
2022-10-13 11:35:55.949006: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3249/4263 [=====================>........] - ETA: 4:31 - loss: 0.3554 - accuracy: 0.8778
2022-10-13 11:36:00.855392: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3627/4263 [========================>.....] - ETA: 2:50 - loss: 0.3502 - accuracy: 0.8798
2022-10-13 11:37:42.697149: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
3667/4263 [========================>.....] - ETA: 2:39 - loss: 0.3495 - accuracy: 0.8799
2022-10-13 11:37:53.545999: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4000/4263 [===========================>..] - ETA: 1:10 - loss: 0.3464 - accuracy: 0.8809
2022-10-13 11:39:24.730068: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4073/4263 [===========================>..] - ETA: 50s - loss: 0.3453 - accuracy: 0.8812
2022-10-13 11:39:45.159087: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
4263/4263 [==============================] - 1148s 268ms/step - loss: 0.3441 - accuracy: 0.8816 Epoch 2/5 4263/4263 [==============================] - 1125s 264ms/step - loss: 0.2973 - accuracy: 0.8974 Epoch 3/5 4263/4263 [==============================] - 1126s 264ms/step - loss: 0.2792 - accuracy: 0.9033 Epoch 4/5 4263/4263 [==============================] - 1110s 260ms/step - loss: 0.2704 - accuracy: 0.9060 Epoch 5/5 271/4263 [>.............................] - ETA: 17:21 - loss: 0.2794 - accuracy: 0.9057
loss0, accuracy0 = model.evaluate(test_ds)
print("Final loss: {:.2f}".format(loss0))
print("Final accuracy: {:.2f}".format(accuracy0))
402/533 [=====================>........] - ETA: 25s - loss: 0.1857 - accuracy: 0.9401
2022-10-12 17:02:31.258458: W tensorflow/core/lib/png/png_io.cc:88] PNG warning: iCCP: known incorrect sRGB profile
404/533 [=====================>........] - ETA: 25s - loss: 0.1852 - accuracy: 0.9404
Our final accuracy is very good at 0.94 for only 5 epochs of training.
We could increase the accuracy further by fine-tuning our model by unfreezing some of the layers of our base model that we could then fit to our dataset.